Skip to content
This repository was archived by the owner on Sep 22, 2025. It is now read-only.

Conversation

@Tanvir-ctrl1
Copy link

Description

Sprint-2 adds classical NLP baselines for the Emotion/State notes dataset.

What’s included

  • New script: AI Guardian/emotional_baseline.py
  • Artifacts (in AI Guardian/out_final/):
    • sprint2_report.md (summary & metrics)
    • sprint2_results.csv (single split), sprint2_cv_results.csv, sprint2_cv_summary.csv
    • confusion_matrix.png
    • feature_preview.txt (evidence of no label tokens in vocab)
    • settings.json (repro config)

Methods

  • Features: BoW(1), TF-IDF(1–2), TF-IDF(1–3) + optional lexicon counts
  • Models: Logistic Regression & Linear SVM (class_weight="balanced")
  • Leakage guard: removes label tokens (normal/sick/uncomfortable + variants) via custom preprocessor + stopword list, with a tripwire that fails if any leak into the vocab
  • Group-aware split (GroupShuffleSplit) + 5-fold GroupKFold CV; dedup after scrubbing

Headline results (use CV as primary)

  • Best CV: TF-IDF(1–2) + LinearSVM0.98 accuracy, 0.956 ± 0.032 macro-F1
  • Single split shown in confusion_matrix.png (informational)

Todos

  • Tested and working locally (sprint2nlp conda env)
  • Code follows project style (black-ish formatting, no notebooks committed for pipeline)
  • Self-reviewed; checked feature preview for label leakage
  • Documentation: sprint2_report.md + settings.json
  • Request review from 2 devs (ML + app)

How to test

  1. Ensure conda env has sklearn/pandas/numpy/scipy/matplotlib.
    conda activate sprint2nlp

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant